A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators
نویسندگان
چکیده
The shift toward parallel computing has resulted into a growing interest in computing systems with heterogeneous processing modules. Reconfigurable devices are often employed in such heterogeneous systems due to their low power and parallel processing benefits. An important issue in the programmability of these systems is the need for a single programming interface. Recent works have leveraged parallel programming models in tandem with high-level synthesis (HLS) to facilitate high abstraction parallel programming of FPGAs. Nevertheless,ion parallel programming of FPGAs. Nevertheless, generating efficient custom hardware accelerators depends on the structure of the parallel input code. Code optimized for programmable multicore devices (e.g. GPUs or CPUs) may result in low-performance custom accelerators. In this work we describe a code optimization framework which analyzes and restructures CUDA kernels that were optimized for GPU devices in order to facilitate synthesis of efficient custom accelerators on FPGA. Our experimental results show that the proposed framework can achieve good performance portability.
منابع مشابه
Automatic Generation of Optimized OpenCL Codes Using OCLoptimizer
The eruption of multicore processors and several kinds of accelerators has generalized the interest in parallel programming. The OpenCL standard is very appealing because it provides code portability across most of these platforms. It defines a programming model where a host code requests the execution of kernels in computational devices. Unfortunately, the host API of OpenCL is quite verbose, ...
متن کاملSource-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators. (Transformations de programme automatiques et source-à-source pour accélérateurs matériels de type GPU)
Since the beginning of the 2000s, the raw performance of processors stopped its exponential increase. The modern graphic processing units (GPUs) have been designed as array of hundreds or thousands of compute units. The GPUs' compute capacity quickly leads them to be diverted from their original target to be used as accelerators for general purpose computation. However programming a GPU e cient...
متن کاملDeveloping a High Performance Gpgpu Compiler Using Cetus
In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the na...
متن کاملPerformance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programmi...
متن کاملTrellis: Portability across architectures with a high-level framework
The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level framewor...
متن کامل